[SPARK-50639][SQL] Improve warning logging in CacheManager #49276

vrozov · 2024-12-23T19:32:27Z

What changes were proposed in this pull request?

The change improves warning logging in the CacheManager by:

Adds logical plan info to the existing warning messages.
Logs warning message in case an attempt is made to remove data from the cache, but data is not present.

Why are the changes needed?

The change helps to identify incorrect calls to Dataset.persist() and Dataset.unpersist() as in

Dataset<Row> dataset = ...
Dataset<Row> dataset1 = dataset.withColumn(...);
Dataset<Row> dataset2 = dataset1.withColumn(...);
dataset.persist(); // OK
dataset1.persist(); // OK
dataset.persist(); // currently logs warning without logical plan details
dataset.unpersist(); // OK
dataset.unpersist(); // no warning
dataset2.unpersist(); // no warning, the actual call should be on dataset1

Does this PR introduce any user-facing change?

Users may see warning messages like:

23.12.2024 19:15:03.840 WARN  [pool-30-thread-1] org.apache.spark.sql.execution.CacheManager - An attempt was made to cache data even though the data had already been cached. Please un-cache data or clear cache first.
Logical plan:
Relation [i#0] JDBCRelation(test_table) [numPartitions=1]

and

23.12.2024 19:15:04.207 WARN  [pool-30-thread-1] org.apache.spark.sql.execution.CacheManager - Data has not been previously cached or it was removed from the cache already.
Logical plan:
Project [i#0, i#0 AS year#6]
+- Relation [i#0] JDBCRelation(test_table) [numPartitions=1]

How was this patch tested?

The change modifies warning log messages.

Was this patch authored or co-authored using generative AI tooling?

No.

vrozov · 2024-12-31T21:43:22Z

@hvanhovell please review

vrozov · 2025-01-07T18:55:35Z

@gengliangwang can you take a look

gengliangwang · 2025-01-07T19:09:37Z

sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala

    val shouldRemove: LogicalPlan => Boolean =
      if (cascade) {
        _.exists(isMatchedPlan)
      } else {
        isMatchedPlan
      }
-    val plansToUncache = cachedData.filter(cd => shouldRemove(cd.plan))
+    var plansToUncache: IndexedSeq[CachedData] = null


@vrozov why do we need the code change here if this PR is to improve the logging?

@gengliangwang logging relies on the plansToUncache.nonEmpty added on line 288 to correctly log warnings, so the change is necessary to prevent race condition (cacheData should be accessed under synchronized).

vrozov · 2025-01-10T18:26:20Z

@gengliangwang please check my reply

gengliangwang · 2025-01-10T18:44:48Z

@vrozov sorry for the late reply.
Before we move forward, I think the changes in this PR are already covered in #45990. If you set spark.sql.dataframeCache.logLevel as WARN, you will see similar logs for cache/uncache.
cc @anchovYu

vrozov · 2025-01-10T20:59:05Z

Change in #45990 provides troubleshooting options for the CacheManager and is useful for debugging memory leaks in the CacheManager. Problem with it is that it is not enabled by default (default is TRACE) and produces large amount of logging. This PR will enable early warning notifications and users can further troubleshoot CacheManager issues by setting spark.sql.dataframeCache.logLevel.

vrozov · 2025-01-14T18:33:34Z

@gengliangwang, @anchovYu please check my reply.

gengliangwang · 2025-01-14T21:57:28Z

@vrozov the changes of this PR overlaps with the PR #45990. How about we simply change the log level from TRACE to INFO?

vrozov · 2025-01-14T22:43:55Z

@gengliangwang My understanding is that log level was set to TRACE intentionally and it should not be enabled by default. Please see comment on #45990

Because every query applies cache, this log could be huge and should be only turned on during some debugging process, and should not enabled by default in production.

Note that warnings on line 129 and 145 coexist with changes from #45990 and provide early problem notification.

This PR originates from a real issue where I spent large amount of time first isolating memory leak to the CacheManager and then debugging it to the unpersist() call on a wrong data set. Should the warning be present in the first place, it would help to identify the problem much easier.

gengliangwang · 2025-01-15T02:18:45Z

Note that warnings on line 129 and 145 coexist with changes from #45990 and provide early problem notification.

Can you provide more details. I think they are similar. Please check the method calls of CacheManager.logCacheOperation.

vrozov · 2025-01-15T06:23:33Z

@gengliangwang Please check lines 129 and lines 145. Those are pre-existing warnings (existed prior to #45990 and were not removed as part of #45990) and they use logWarning(), not CacheManager.logCacheOperation() (that depends on spark.sql.dataframeCache.logLevel settings and are off by default). I think, this is the right approach as changes in #45990 provide means to further troubleshoot CacheManager (without code debugging), and warnings in this PR provides early problem notification.

I think they are similar.

They are related, but serve different purpose. Entries that use logWarning() are enabled by default (including production), while logCacheOperation() are disabled by default and should only be enabled to debug caching issues.

vrozov · 2025-01-16T18:07:45Z

@gengliangwang Please check my reply.

To clarify why I think #45990 and #49276 are related but do not overlap: changes in #45990 log (trace by default) messages when the cache is modified (an item is added or removed from the cache), while changes in #49276 log warning messages when the cache is expected to be modified by a user, but is not modified (an item is not added or is not removed) along with details about Dataset that is used in the call to persist() or unpersist().

vrozov · 2025-01-21T16:26:16Z

@gengliangwang Please check my reply.

gengliangwang · 2025-01-24T22:04:03Z

sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala

@@ -126,7 +126,9 @@ class CacheManager extends Logging with AdaptiveSparkPlanHelper {
    if (storageLevel == StorageLevel.NONE) {
      // Do nothing for StorageLevel.NONE since it will not actually cache any data.
    } else if (lookupCachedDataInternal(normalizedPlan).nonEmpty) {
-      logWarning("Asked to cache already cached data.")
+      logWarning(log"An attempt was made to cache data even though the data had already been " +


@vrozov so in the method call of lookupCachedDataInternal, it will output

CacheManager.logCacheOperation(log"Dataframe cache hit for input plan:" + log"\n${MDC(QUERY_PLAN, plan)} matched with cache entry:" + log"${MDC(DATAFRAME_CACHE_ENTRY, result.get)}")

while in the change here it suggests users to uncache instead of cache is hit.

The log from #45990 is more accurate, right?

No, it is not. The cache hit is irrelevant in this case as the caller(user) does not expect entry to be present in the cache and the user explicitly calls persist() while cache lookup is an internal operation not initiated by the caller. The warning message in #49276 means that user either does not need to call persist() (and should remove it) or there is missing/wrong call to unpersist().

@gengliangwang Please see my response.

@vrozov can you explain why we need to uncache data or clear cache? If the plan is already cached, then we don't need to cache it again

@gengliangwang sounds that we are going in a circle here. The warning message is logged when there is a call to persist() when DataSet is already present in the cache, so it is an indication of a bug in the code that calls persist(). Either there is a missing call to unpersist() on the Dataset (there may be a call to unpersist() on another Dataset that was not cached as in the sample I provided in the PR comment) or the call is not necessary at all. The warning message is a way for the user to be notified that there is a problem (bug) in the code and it is up to the user to see how to fix it. If you have better suggestion on the warning message wording, please post it here. As far as I can see, the proposed warning message indicates that

the data had already being cached, so the call may not be necessary

there may be a missing call to unpersist() or clear cache

Please un-cache data or clear cache first.\nLogical plan:\n"

Asking users to call unpersist is unnecessary and super confusing.

@gengliangwang

I disagree that it is super confusing and explained when call to unpersist() may be necessary.

To move this PR forward I updated warning message and removed reference to un-cache data.

@gengliangwang This code change is to provide added-value for warning log as developers can easily identify which query_plan was already persisted.

Before the change: The warning log only showed Asked to cache already cached data. Developers can not identify which query_plan was already cached from the warning message. For large project, it means the warning does not add value to the user as there might be too many dataframe in the project.

After the change: The warning log showed which query_plan was already cached. Then developers can easily check their code to identify the unnecessary cache/persist for specific dataframe.

yangguoaws · 2025-02-06T13:16:49Z

sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala

-    uncacheByCondition(spark, _.sameResult(plan), cascade, blocking)
+    if (!uncacheByCondition(spark, _.sameResult(plan), cascade, blocking)) {
+      logWarning(log"Data has not been previously cached or it was removed from the " +
+        log"cache already.\nLogical plan:\n${MDC(QUERY_PLAN, plan)}")
+    }


@gengliangwang This log is to warn that developers are trying to unpersist a query_plan which has not been previously cached and show the related query plan details.

For example, this sample pyspark code is trying to unpersist a redefined a dataframe. This leavs the query plan of the original cached dataframe in CacheManager. If this happens in for loop or spark structured streaming foreachbatch, the driver memory will constantly increase and lead to memory issue.

df = spark.createDataFrame(data, ["name", "age", "city"]) df.persist() df.show() df = df.withColumn("NAME", upper(col("name"))) df.show() df.unpersist()

The proposed change here is to help developers easily identify they are trying to unpersist a query_plan which has not been previously cached. Then developers can review their code to confirm whether they are unpersisting a wrong dataframe.

yangguoaws

@vrozov @gengliangwang I added comments for two warning log changes and illustrated their added value for developers. Please let me know if any changes needed to move this PR forward.

vrozov · 2025-02-10T19:17:22Z

@gengliangwang Please review

vrozov · 2025-02-14T23:06:29Z

@gengliangwang Please review

gengliangwang · 2025-02-15T00:00:10Z

sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala

@@ -126,7 +126,9 @@ class CacheManager extends Logging with AdaptiveSparkPlanHelper {
    if (storageLevel == StorageLevel.NONE) {
      // Do nothing for StorageLevel.NONE since it will not actually cache any data.
    } else if (lookupCachedDataInternal(normalizedPlan).nonEmpty) {
-      logWarning("Asked to cache already cached data.")
+      logWarning(log"An attempt was made to cache data even though the data had already been " +


Again, this is very similar to the changes in https://github.com/apache/spark/pull/45990/files#diff-88635a13b65f19dcc80b865d903b498b8328607f96c088402a8ebdbb857eedf9R303. I don't think we should have two duplicated logs with different log level.

@gengliangwang The log entry is not added as part of this PR, so I don't follow your concern.

vrozov · 2025-02-24T16:42:58Z

@gengliangwang Please see my response to your comment

vrozov · 2025-02-26T21:24:59Z

@gengliangwang ^^^

vrozov · 2025-02-28T17:01:07Z

@gengliangwang ?

vrozov · 2025-03-03T23:32:58Z

@hvanhovell @dongjoon-hyun Please take a look.

vrozov · 2025-03-06T16:37:29Z

@hvanhovell ? @dongjoon-hyun ?

github-actions bot added the SQL label Dec 23, 2024

vrozov force-pushed the Warning branch from beb82bd to 9d1df95 Compare December 23, 2024 19:51

gengliangwang reviewed Jan 7, 2025

View reviewed changes

gengliangwang reviewed Jan 24, 2025

View reviewed changes

[SPARK-50639][SQL] Improve warning logging in CacheManager

c6d02c3

vrozov force-pushed the Warning branch from 9d1df95 to c6d02c3 Compare February 4, 2025 19:43

yangguoaws reviewed Feb 6, 2025

View reviewed changes

gengliangwang reviewed Feb 15, 2025

View reviewed changes

[SPARK-50639][SQL] Improve warning logging in CacheManager #49276

Are you sure you want to change the base?

[SPARK-50639][SQL] Improve warning logging in CacheManager #49276

Conversation

vrozov commented Dec 23, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

vrozov commented Dec 31, 2024

vrozov commented Jan 7, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vrozov commented Jan 10, 2025

gengliangwang commented Jan 10, 2025

vrozov commented Jan 10, 2025

vrozov commented Jan 14, 2025

gengliangwang commented Jan 14, 2025 • edited Loading

vrozov commented Jan 14, 2025

gengliangwang commented Jan 15, 2025

vrozov commented Jan 15, 2025

vrozov commented Jan 16, 2025

vrozov commented Jan 21, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vrozov Feb 4, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yangguoaws left a comment

Choose a reason for hiding this comment

vrozov commented Feb 10, 2025

vrozov commented Feb 14, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vrozov commented Feb 24, 2025

vrozov commented Feb 26, 2025

vrozov commented Feb 28, 2025

vrozov commented Mar 3, 2025

vrozov commented Mar 6, 2025

gengliangwang commented Jan 14, 2025 •

edited

Loading

vrozov Feb 4, 2025 •

edited

Loading